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Abstract 

We investigate how in complex systems the eigenpairs of the matrices derived 
from the correlations of multichannel observations reflect the cluster structure of 
the underlying networks. For this we use daily return data from the NYSE and fo- 
cus specifically on the spectral properties of weight Wy = \C\ij — 5ij and diffusion 
matrices Dij = Wij/sj — 5ij, where C y - is the correlation matrix and = ^ ■ Wij the 
strength of node j. The eigenvalues (and corresponding eigenvectors) of the weight 
matrix are ranked in descending order. In accord with the earlier observations the 
first eigenvector stands for a measure of the market correlations. Its components are 
to first approximation equal to the strengths of the nodes and there is a second or- 
der, roughly linear, correction. The high ranking eigenvectors, excluding the highest 
ranking one, are usually assigned to market sectors and industrial branches. Our 
study shows that both for weight and diffusion matrices the eigenpair analysis is 
not capable of easily deducing the cluster structure of the network without a priori 
knowledge. In addition we have studied the clustering of stocks using the asset graph 
approach with and without spectrum based noise filtering. It turns out that asset 
graphs are quite insensitive to noise and there is no sharp percolation transition as a 
function of the ratio of bonds included, thus no natural threshold value for that ratio 
seems to exist. We suggest that these observations can be of use for other correlation 
based networks as well. 
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1 Introduction 



The network approach to complex systems has turned out to be extremely 
fruitful in revealing their structure and function . The usual way to 

construct the network is to identify the elements of the system with nodes, 
between which the links are present if the corresponding interactions exist. 
In the case of weighted networks, the weight of a link is identified with the 
strength of the interaction. 

Processes taking place in a complex system, represented as a network, depend 
heavily on its structure. For example motifs that are statistically significantly 
overrepresented as compared to a random reference system are supposed to 
have some functional role 0, 0|. Moreover, communities i.e. groups that are 
well wired internally but loosely connected to the rest of the network, play 
an eminent role in dynamic phenomena like spreading a a a. Clearly, the 
investigation of the network structure is of central interest. 

For many systems, however, the nature of interactions is hidden and only some 
activities of the nodes can be measured, e.g., in the form of time series. For 
such systems, the natural network representation is a complete graph with 
weights corresponding to the elements of the correlation matrix determined 
by the nodal activities. Then the task is to filter out from the noisy correlation 
matrix the groups of closely related elements. This problem is quite general 
and it appears in many fields of research ranging from the evaluation of micro- 
array data to portfolio optimization. In this paper we have chosen to study 
correlation matrices of stock returns, but we think that the network approach 
and the observations made here have also more general validity. 

Correlations between time-series of stock returns serve as one of the main 
inputs in the portfolio optimization theory. In the classical Markowitz portfolio 
optimization the correlations are used as measures of the dependence between 



the time series and the variance as the measure of risk lQ|]|_j As the empirical 
time series are always finite, the resulting correlation matrix is noisy. This 
brings up the need to reduce the noise, for which the most frequently applied 



tool in the financial literature is principal component analysis [12|| 



Previously, correlation matrices of stock return time series have been studied 
from the network point of view, e.g., by using maximal spanning trees (MST). 
The maximal spanning tree of a network is a tree containing all the N nodes 
and N — 1 links such that the sum of the weights is maximized. It was in- 



troduced in the study of financial correlation matrices by Mantegna [13|, who 



was able to identify groups of stocks that make sense from an economic point 



1 There are conceptual problems with this approach, which we do not want to 



address here [11] 
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of view. It was discovered that often, the branches of the MST correspond to 
business sectors or industries. Moreover, this method enables to describe the 
hierarchical organization of the market and has been applied to monitor the 
effect of the time dependence of the correlations [IT]] Later, M STs of diverse 
financial correlation based networks have been studied, e.g., by Bonanno et 



al. 14, 15, 16 and Onnela et al. 17, 18 . Indeed, the MST method is sim 



pie and gives reasonable results. However, it is too restrictive and thus other, 
complementary methods are needed. 



In the so called asset graph approach, one ranks the links according to the 
values of the corresponding correlation matrix and considers only a fraction 
p of the stron gest ones as occupied. By using this method for low values of p 



Onnela et al. [19|| and Heimo et al. [20|] found clear evidence of strong intra- 



business sector clustering. It has also been suggested that planar maximally 
filtered graphs yield a natural extension to the MST approach [2JJ] . Other in- 
teresting approaches include methods based on the super-paramagnetic Potts 



model [22j and on the maximum likelihood optimization [231 ] . Several financial 



markets have been studied from the above points of view. 

Based on these approaches a following picture about the organization of the 
correlation network of the stocks emerges: i) There is a dominant correlation 
among most of the stocks reflecting the overall behavior of the market (this 
is the basis of the one factor model HI); ii) The stocks are organized hi- 
erarchically in clusters, which mainly correspond to industrial branches (as 
assumed in the multi-factor models) [H| iii) There are systematic deviations 
from this oversimplified picture, partly because of the ambiguous nature of 
any classification scheme and partly because of inter-cluster relations; iv) In 
spite of considerable robustness in the correlations during the time of "business 
as usual", major events like crashes cause dramatic changes in the network 
structure \Y] 



All information on the network structure is encoded in its adjacency matrix, 
or, for weighted networks, in the weight matrix. Likewise, all information on 
the structure of correlations of stock returns is to be found in the correlation 
matrix. Consequently, this information is also inherited by the eigenvalues and 
eigenvectors of such matrices. If the data is structured in terms of clusters and 
communities that these matrices represent, it should also be reflected in the 
eigenpairs. For financial correlation matrices, it has been shown that most 
eigenpairs correspond to noise accessible by random matrix theory and thus 
the information about the cluster structure is contained in a few non-random 
24, 25, [2^| (see 27 for an overview). It was also suggested that 



eigenpairs 

clusters of highly correlated stocks could be identified by studying the local- 
ization of non-random eigenvectors. In this paper we investigate the questions 
of how the eigenpairs of correlation and related matrices derived from stock 
price time series reflect the cluster structure and industry sectors. 
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This paper is organized as follows. Section 2 gives a short introduction to the 
spectral properties of weight and diffusion matrices and their relationship to 
the cluster structure of the underlying network. Section 3 is devoted to the 
analysis of the largest eigenvalue and the corresponding eigenvector, whereas 
the intermediate, non-random eigenpairs are examined in Section 4. The asset 
graph approach to the clustering of stocks is discussed in Section 5. 



2 Basic notions 



2.1 Matrices related to weighted networks 



A simple undirected and weighted network can be represented by a weight 
matrix W in which an element Wij = Wji (> in our study) corresponds to 
the weight of the link between the nodes i and j and the diagonal elements are 
zero. Note that = signifies the absence of the link. The sum Sj = J2j 



is the strength of node % [28fl. Here, we restrict our analysis to irreducible net- 
works, i.e., networks consisting of just one connected component. In this case 
the Frobenius-Perron theorem states that W has a largest positive eigenvalue 
and the components of the corresponding eigenvector are non-zero and of the 
same signU] 

If the elements of W are i.i.d. random numbers with finite variance a 2 , the 
probability density of the first eigenvalue converges to a normal distribution 



30 



lim distr f Ai - \(N - + a 2 / >1 } = W(0, 2a 2 ) (1) 

where /x > is the mean of the matrix elements. The first term inside the 
square brackets expresses the average node strength, while the second term is 
due to fluctuations. In some cases statements can be made about the whole 
spectrum. We return to this in section 2.3. 

Diffusion process in terms of random walks can be used as a tool for studying 



the structure of a network [3jJ, |32j, |33j. At each time step a walker moves at 



random from its current node j to node i with probability Ty = Wij/sj. If 



2 In network analysis, these components have been interpreted as measures of cen- 
trality of the corresponding nodes (see., e.g., jl, [3]). Here the idea is that the 
centrality xi of node i should be proportional to the average of the centralities of 
its neigbours, weighted by the weights of the connecting links. This leads to the 
equation = j ^ ■ WijXj, where A is a constant. In matrix form, Wx = Xx, and 
with the restriction Xi > the only non-trivial solution is the eigenvector corre- 
sponding to the largest eigenvalue. This measure of centrality is often referred to as 
the eigenvector centrality. 
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we denote the walker density at node i by Vi(t), the average dynamics of the 
process is described by 

v{t + 1) = Tv(t) (2) 

or equivalently by 

v(t + l) -v(t) = Dv(t), (3) 

where v(t) = (vi(t), . . . ,V]y(t)) T and D = T — I . Here, T denotes the transfer 
matrix and D the diffusion matrix. These matrices have clearly the same 
eigenvectors and the spectrum of D is identical with the spectrum of T shifted 
to the left by unity. The matrix T can be mapped into a symmetric matrix 
by the similarity transformation diag(s~ 1 ^ 2 ) • T ■ diag(s^ 2 ) and therefore its 
eigenvalues are real. Futhermore, the walker density cannot diverge at any 
node, so the eigenvalues of T must lie within the interval [—1, 1]. The strength 
vector s — (si, . . . , s^) is an eigenvector of T with eigenvalue 1 and due to 
the Frobenius-Perron theorem this eigenvalue is non-degenerate. Thus 

lim v(t) = s, (4) 

t — >oc 

unless the network is bipartite, in which case —1 is also an eigenvalue. 



In addition to walker densities Vi(t), the diffusion process can also be analyzed 
by studying the walker densities per unit strength defined by 

cm = ^. (5) 

It is straightforward to show that 

c(t + 1) = Nc(t), (6) 

where N = T is called the normal matrix. Clearly the only difference between 
the governing equations for the densities and the densities per unit strength 
is that T is replaced with N . From Eq. (4) we see that 

limc(t) = (l,...,l) T . (7) 

t — >oo 



2.2 Modular structure and eigenvectors 



Recently there has been increasing interest in the "mesoscopic" properties 
of networks, i.e., in structures beyond the scale of single vertices or their 
immediate neighborhoods. One important related problem is the detection 



and characterization of modules or communities 0, S, 34, 35, 36, 37 , which 
are, loosely speaking, groups of vertices with dense internal connections and 
weaker connections to the rest of the network. 
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Evidently, the weight, transfer, normal, and diffusion matrix representations 
of a modular network carry information about the modules in their eigenvalues 
and -vectors. In the case of diffusion, it is temptin g to assign to the eigenvalues 



and -vectors a direct physical interpretation [31j, [32|, [33|. If a random walker 
enters a module with dense internal connections and sparse connections to the 
rest of the network, it gets, on average, "trapped" for a long time. This phe- 
nomenon is reflected to the spectral expansion of the random walker density 
at time t, 

(8) 

3 

c = £TV(0), (9) 

where E contains the eigenvectors e} of the transfer matrix in its columns. If 
convergence to the stationary state is slow, Eq. (8) should contain some terms 
with eigenvalues close to 1 or —1. Eigenvalues close to —1 indicate that the 
network is almost bipartite. On the other hand, large positive eigenvalues are 
consequences of modular structure and the corresponding eigenvectors can be 
expected to carry information about the structure of communities. 

The interpretation of the eigenpairs of the weight matrix is more difficult. 
However, one can naturally iterate a vector v (a "phantom field" on the nodes 
of the network) by the weight matrix, and study its properties. Eq. (8) still 
applies, and Eq. (9) can be written in a simpler form Cj = ej ■ v(0) due to the 
symmetry of the weight matrix (of course, e} are now eigenvectors of the weight 
matrix). Here a convenient initial condition is Vj(0) = 5^, and as v a (t + 1) = 
J2j W a jVj(t), the new value of this quantity on node a will be a weighted sum of 
the (old) values on a's neighbours. If the spreading of this quantity starts from 
node i, located in a densely interconnected module, during the first time steps 
only the other members of this module get significant contribution, as i and 
its neighbours have most of their links within the module. The phenomenon 
resembles the "trap"-behaviour of the modules in the case of diffusion. Noticing 
thafl 

where Ai is the largest eigenvalue of W . we see that the ratios Vi/vj approach 
to constants. The speed of the convergence depends naturally on the magni- 
tudes of the other eigenvalues. 

The fact that eigenvalues close to the largest one slow down the convergence, 
suggests that the corresponding eigenvectors can carry information about the 
modules, similarly to the eigenvectors of the diffusion matrix. In conclusion, 
it seems to be reasonable to interpret the eigenvectors of the weight matrix 
similarly to the eigenvectors of the diffusion matrix, at least from the point of 
view of network modularity. 



Here, we must assume that v(0) is not perpendicular to e\. 
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Now, let us turn shortly to the interpretation of eigenvector components, both 
for diffusion and weight matrices. Consider a case in which the eigenvectors 
are ranked in descending order and A 2 3> |A 3 ..jv| ~ 0. Assume that the sec- 
ond eigenvector is localized on two sets of nodes, such that the eigenvector 
components corresponding to the first set are positive, and the components 
corresponding to the second set are negative. Then, the second term in Eq. 
(8) gives a slowly decaying correction for both sets of nodes, but with different 
signs. This means that the random walker or the "phantom field" gets trapped 
for a while in one set, and is held back from entering into the other set. So 
the first set of nodes can be thought of as a community. Changing the initial 
condition such that c 2 changes its sign, and applying the above arguments 
shows that the other set can also be thought of as a community. Both of these 
cases show that the two communities are far from each other as regards to the 
average travelling time of a random walker between them. It should be noted 
here that using absolute values or squares of the eigenvector components (e.g. 



381 ] . [39]) is clearly inappropriate, as an eigenvector may be localized on two 



extremely distant communities. 

As mentioned in the previous section the diffusion process can also be analyzed 
by studying the time evolution of the walker densities per unit strength Cj(t). 
Simonsen et al. have suggested that the eigenvectors of the normal matrix N 
corresponding to the largest eigenvalues contain a lot of information about the 
modular structure of the network and that the modules can be identified with 



the so called current mapping technique [31], |32|, |33|] . Here, one should notice 
that the ith component of the kth eigenvector of N is equal to [e^Jj/sj, where 
[efc]i is the ith component of the kth eigenvector of the transfer matrix T. 



2.3 Correlation Matrices 



The equal time correlation matrix C of A variables can be estimated from T 
observations by 

(riTj) - (r i )(r j ) 



C tJ = (11) 



where r ; is a vector containing the observations of the variable i. In the case of 
Gaussian i.i.d. variables, C is the Wishart matrix and its eigenvalue density 
converges as A — >• oo, T — > oo, while A/T < 1 is fixed, to 
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max - -A)(A— A m i n ) • r \ ^ \ ^ \ 

pw {\) = { ~^ ' 11 Amin " A " Amax (12) 

else 



A m ax/ mi n = ^ [I ± J N/T) , (13) 
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where a 2 = 1 due to the "normalization" in Eq. (lljjj. In empirical cases, sig- 
nificant deviations from Eq. (12) can usually be considered as signs of relevant 
information 26 . 



A correlation matrix can be transformed to weight matrix of a simple undi- 
rected and weighted network by 

l>-„ O, (14) 

From the point of view of network theory, the transformation can be justified 
by interpreting the absolute values as measures of interaction strength without 
considering whether the interaction is positive or negative. If the elements of 
C are non-negative, W = C — I. Therefore, the transformation does not 
change the eigenvectors but the eigenvalues are shifted to the left by unity. 
The correlation matrices studied in this paper, however, contain a few slightly 
non-negative elements. Fortunately, taking the absolute value of the elements 
does not change the numerical values of the spectral quantities significantly. 

In the following sections, we study correlation matrices constructed from the 
logarithmic returns of New York Stock Exchange (NYSE) traded stocks. We 
use two different data sets. The larger one consists of the daily closing prices 
of 476 stocks and ranges from 2-Jan-1980 to 31-Dec-1999. In the smaller data 
set the number of stocks is 116 and the time window ranges from 13-Jan-1997 
to 29-Jan-2000. With both data sets the length of the time series T is not very 
large compared to the number of stocks N. Therefore the correlation matrices 
are noisy. 



3 First eigenpair 



The largest eigenvalue of a correlation matrix derived from stock return time 
series is always clearly separated from the rest of the spectrum. The corre- 
sponding eigenvector is typically interpreted to be representative of the whole 



market 26], and is usually called the market eigenvector. 



3. 1 Approximations of the first eigenvector 



Perhaps the simplest way to approximate the first eigenvector of the weight 
matrix is to iterate a vector that is not perpendicular to the first eigenvector by 
the weight matrix (see Eq. (10)). From the Frobenius-Perron theorem we know 
that the components of the first eigenvector have the same sign, so a natural 

4 Without the normalization, a 2 would be the variance of the variables. 
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choice for the initial vector is one with uniform components. The first iteration 
of this vector yields a vector proportional to the strength vector s. Since s is 
the first eigenvector of the diffusion matrix derived from the weight matrix, 
we see that the first eigenvectors of the weight and the diffusion matrices 
are the sime after first iteration. Of course, one can find examples where this 
approximation is far from the asymptotics. 

Another simple way to approximate the first eigenvector of the weight matrix 



is a perturbation-based calculation. In [39||, such an approach was presented 



although with a different definition of perturbation. Here we separate the 
empirical weight matrix into two terms as 

Wn = (i - <y ™ + Vij, (15) 

where w is the average off-diagonal element of the weight matrix and is 
the deviation considered as perturbation. The eigenvalues of the unperturbed 
matrix are 

\? = (N- 1)w , (16) 

A^jv = -w , (17) 

and the first eigenvector is 

^ 0) " 1 'M,..., If (18) 



'N 

The first order correction of the first eigenvalue reads 

A« = ^Ve1°J = lE^ = 0, (19) 

ij 

and thus Ai ~ (N — 1)wq, assuming that the perturbation expansion converges 
fast. The first order correction of the first eigenvector is 

61 ~ Nw 1 Nw 1 ~ iV^o 1 ' { ' 

and the ith component the corresponding first order approximation is 

s 



1 + ^)^^7' ( 21 ) 



e1° } + e-f 

V^V Nw sVN 
where iV ^> 1 has been assumed. 

Thus the result is similar to the one obtained by the iterative way - in first 
order perturbation theory, the components of the first eigenvector are propor- 
tional to the corresponding strengths. This proportionality was also pointed 



out in Ref. 39]. We add to this observation that, provided the first order 



approximation is sufficient, the average off-diagonal matrix element is propor- 
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tional to the largest eigenvalue. 



3.2 First eigenpair of financial correlation based networks 

Based on the previous section, it is not surprising that the largest eigenvalue 
Ai is strongly correlated with the mean correlation coefficient in both data 
sets used. This is illustrated in Fig. 1. Deviations from the zeroth order ap- 
proximation are illustrated in Fig. 2. 



160 r 




1979 1982 1985 1988 1991 1994 1997 2000 1977 1981 1985 1989 1993 1997 2001 

time time 

Figure 1. (color online) The first eigenvalues (solid line) and rescaled mean correla- 
tions (dashed line) as functions of time for the 116-stocks database (on the left) and 
for the 476-stocks database (on the right). The correlation matrices were constructed 
from 1000 trading days in both cases. The outstanding plateau is a consequence of 
Black Monday, a large market crash on October 19, 1987. 




°'1 6 979 1983 1987 1991 1995 1999 1977 1981 1985 1989 1993 1997 2001 

time time 

Figure 2. Difference between the first eigenvalue and its zeroth order approximation 
for the 116-stocks dataset (on the left) and for the 476-stocks dataset (on the right). 
A window of 1000 trading days has been used. 

As expected, the eigenvector corresponding to the largest eigenvalue is well 
approximated by the strength vector. This is illustrated in Fig. 3. A fur- 
ther observation is that the relative differences of the components of the first 
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Figure 3. (color online) Components of the first eigenvector (A) and the normalized 
strengths (o) (the larger data set is used). The ordering of the stocks is such that 
stocks belonging to same business sector according to Yahoo classification [41J are 
next to each other. 



eigenvector and the (normalized) strengths are positively correlated with the 
strengths (Fig. 4, panel a). In order to understand this effect, we have con- 
structed and numerically analyzed several kinds of correlation matrices. Our 
observations are as follows: the above correlation does not exist for random 
matrices, in which the elements are i.i.d. random variables from uniform dis- 
tribution (Fig. 4, panel b). Surprisingly, the correlation is negative for the 
one factor model with the same mean correlation, when the correlation ma- 
trix is constructed from finite time segments of uncorrelated Gaussian time 
series (Fig. 4, panel c). A strength distribution with non- vanishing width 
produces similar effect. However, for multi-block weight matrices with an ar- 
tificial modular structure together with additional noiseJjD similar correlation 
is found (Fig. 4, panel d). Hence, the observed correlation could be attributed 
to the presence of modular structure in the weight matrix. 

There is another interesting feature in Fig. 4 worth noting. The "outliers" in 
the lower left corner of panel a in Fig. 4 correspond to companies related to 
gold and silver mining, which are known to be extremely weakly correlated 
(or even negatively correlated) with the other participants of the market. 



5 The matrices were constructed by Wij = W-j + 0.1 • rjirjjr, where the communities 
are represented by matrix W° containing ten blocks of size 45 x 45 on the diagonal, 
f]i = \s + 1| is a random parameter for each node, s is drawn from the standard 
normal distribution, and r is drawn from the standard uniform distribution. W 
was normalized such that the mean element was equal to the mean element of the 
empirical weight matrix. 
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Figure 4. The relative differences of the components of the first eigenvector and 
the (normalized) strength- vector as functions of the (normalized) strengths, a) The 
empirical data set (N = 476 stocks), b) random matrices with i.i.d elements from 
the uniform distribution, c) correlation matrices of the one factor model with the 
same length of time series and the same mean correlation as the empirical matrix, 
d) artificial multiblock correlation matrices. All results for artificial matrices are 
averages over 1000 runs. 

4 Intermediate eigenpairs 

In this section we analyze the intermediate eigenvectors of the empirical weight 
and diffusion matricescJ We start by discussing the problems related to defin- 
ing the information carrying eigenvectors of the weight matrix and continue 
by studying how the cluster structure of the network is reflected in the local- 
ization of these eigenvectors. Lastly, we analyze the intermediate eigenvectors 
of the diffusion matrix and find the highest ranking ones to be very close 
to those of the weigth matrix. We also demonstrate the use of the current 
mapping technique with our data set. 



In this section, only the larger data set is studied 
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4-1 Defining the intermediate eigenpairs of the weight matrix 



The highest ranking eigenpairs of the correlation matrix constructed from 



stock return time series are far from being random [24.I25 



but the randomness 
Therefore, 



20 



increases rapidly together with increasing rank (on average) 
there is no strict border between the random and intermediate parts of the 
spectrum and the identification of the information carrying eigenvectors is a 
highly non-trivial task. 

Fig. 5 depicts the spectrum of the weight matrix together with the analyt- 
ical results for Wishart matrices (Eq. (12)) 13 The analytical curve is fitted 
by visual inspection using a, i.e. the variance of the effectively random part 
of the correlation matrix, as an adjustable parameter. Best fit is obtained 
with a ~ 0.86, which, substituted into Eq. (13), yields X ma x ~ 0.3. However, 
many eigenvectors corresponding to eigenvalues above this bound are to a 
large extent random and on the other hand, some below this bound contain 



information |43j. Therefore, \ max can only be considered as a suggestive indi- 
cator of the crossover region between the random and intermediate parts of 
the spectrum. 




Figure 5. Left: The spectral density of the weight matrix. The inset shows the random 
bulk and the analytical curve. Right: The IPRs of the eigenvectors as a function of 
the corresponding eigenvalue. 



Plerou et al. [25|, |26j] have suggested the use of inverse participation ratios 
(IPR), defined for vector v as 



(22) 



Note that the spectrum is shifted to the left by 1, due to Eq. (14). In 4J] an 
improved fit is suggested based on the random matrix theory of power law distributed 
variables. However, the minor difference in the fitting is irrelevant from our points 
of view. 
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in the identification of the information carrying eigenvectors. The idea behind 
this is that the more localized the eigenvector is, the higher is its IPR. From 
the right panel of Fig. 5, which depicts I(v) for the eigenvectors of the weight 
matrix as a function of the corresponding eigenvalue, we see that most of 
the random and intermediate eigenvectors have similar IPRs (see also [431]) . 
Thus IPR does not seem to be an efficient tool to distinguish the information 
carrying eigenvectors from the rest0 More sophisticated analysis is needed. 



4-2 Localization of the eigenvectors 



We have seen that the gradual increase of the noise content aleady makes the 
identification of the clusters in the network a difficult task. Here we will go 
into the further difficulties caused be the complexity of the localization of the 
information carrying eigenvectors. Financial correlations are particularly ap- 
propriate to investigate this point as independent classification schemes exist 
to compare with. In the following we will take advantage of this informa- 
tion in the example-like analysis of a couple of interesting eigenvectors. The 
components of the eigenvectors studied are illustrated in Fig. 6, in which the 
(horizontal) ordering of the stocks is such that stocks belonging to same busi- 



ness sector according to Yahoo classification [4l|| are next to each other. This 



makes the eigenvectors localized on a business sector stand out more clearly. 

The highest ranking intermediate eigenvector, namely the second eigenvector 
is a good example of an eigenvector localized on a business sector. The com- 
ponents corresponding to the utilities sector stand out very cleanly in Fig. 
6. One should notice, however, that this would not be the case without the 
chosen horizontal ordering of the companies. Without a priori information we 
would not be able to define boundaries for this cluster. 

The third eigenvector, which is mainly localized on oil and gold & silver min- 
ing companies is already a more difficult one. The other large components 
of this eigenvector correspond to Petroleum & Resources (P&R), a financial 
company specialized in the energy sector, Tidewater (Tidew), which provides 
vessels and services for the offshore energy industry, and ASA Ltd. (ASA), 
an investment company interested in precious metal mining. The thresholding 
analysis (not presented here) shows that the companies corresponding to the 
largest components of this eigenvector form two clusters. The third eigenvector 
is not the only example of a high ranking intermediate eigenvector localized on 
more than one cluster. The sixth eigenvector, for example, is localized on gold 
& silver mining, leading electronics manufacturers & electronics stores, and 



The high IPRs of the lowest ranking eigenvectors are due to the well known fact 
that they are localized to pairs of stocks with the very highest correlation coefficients 

2I, H0. 
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Figure 6. Component sizes of chosen eigenvectors. The number in each panel in- 
dicates the rank of the corresponding eigenvalue. Horizontal ordering is such that 
stocks belonging to same business sectors are next to each other and open symbols 
are used as a guide to the eye. For abbreviations, see text. 



air transportation companies. Again, the thresholding analysis shows that all 
these industries form their own clusters. Interestingly the seventh eigenvector 
is localized solely on the gold & silver mining-related companies. 

One encounters further difficulties, when analyzing e.g. the tenth eigenvector, 
which has a very complex structure. As illustrated in Fig. 6, it is localized 
on a large number of industry branches, most of which can be found by the 
thresholding analysis. However, without some prior information, interpretation 
of this eigenvector is impossible. On the other hand, surprisingly, the 19th 
eigenvector can be straightforwardly interpreted although the corresponding 
eigenvalue is close to the random part of the spectrum (Aig ~ 0.55) and the 
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neighbouring eigenvectors are to a large extent random. This eigenvector is 
strongly localized on Sony and Honda, the only Japanese companies in the 
data set. It should be noted that several eigenvectors corresponding to the 
lowest ranking eigenvalues are localized on pairs of companies with highest 
correlation coefficients. 

To summarize, it seems evident that the cluster structure of a network can- 
not be easily deduced from the eigenvectors of the weight matrix. Especially, 
interpretation of a single eigenvector is even more difficult than suggested in 
recent literature. Most of the information about the cluster structure can only 
be found by combining information from different eigenvector/^! There is, 
however, no rule to tell, which linear combination of the eigenvectors should 
be taken. Therefore, the extraction of the cluster structure from the eigenvec- 
tors without a priori knowledge about the nodes (here companies) seems to 
be a too formidable task. 



4-3 Diffusion based approach 



In section 2.2 we reasoned, how the cluster structure of a network should affect 
the diffusion process. The spectrum of the diffusion matrix (depicted by the 
solid line in Fig. 7) has, as expected, similar structure with that of the weight 
matrix. For diffusion matrices, results corresponding to Eqs. (12) and (13) are 
not known, but a random reference function can be obtained numerically by 
constructing diffusion matrices from random weight matrices generated with 
the method presented in section 3.2 and in [42| (dashed line in Fig. 7). In Fig. 
8 we compare the eigenvectors of the weight and diffusion matrices. We see 
that the highest ranking eigenvectors of the diffusion matrix are very close 
to the corresponding eigenvectors (i.e. eigenvectors with similar localization) 
of the weight matrix. Their distance increases when the random part of the 
spectrum is approached and the correspondence between pairs of eigenvectors 
looses its meaning. 

The analysis of the eigenvectors of the normal matrix is again non-trivial. Nat- 
urally, there are correlations between the components of different eigenvectors, 
but it is impossible to identify clusters without a priori information. Best re- 
sults in two dimensions were obtained with the eigenvectors of ranks two and 
five (see Fig. 9). Visual inspection of this plot allows us to identify the oil k, 
gas, utilities and gold & silver mining clusters, although the determination of 
the boundaries is again difficult. 



9 



This was also suggested in [38|], in which symmetric and antisymmetric combina- 



tions of eigenvectors are analyzed. 
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Figure 7. (Color online) Spectral density of the diffusion matrix corresponding to 
the larger dataset (solid line), and the average spectral density over a system of 1000 
random references (dashed line). The trivial eigenvalue at is not shown. 
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Figure 8. Scalar products of the eigenvectors of the diffusion matrix and the cor- 
responding eigenvectors (i.e. eigenvectors with similar localization) of the weight 
matrix. The eigenvectors of the diffusion matrix are ordered according to decreasing 
rank (x-axis). 

5 Asset graph approach to the clustering of stocks 



In this section we study the clustering of stocks using asset graphs [19l|] 
An asset graph is constructed by ranking the non-diagonal elements of the 
correlation matrix and adding links between stocks one after the other, starting 
from the strongest correlation coefficient. The network thus emerging can be 

10 In this section, only the smaller dataset is studied 
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Figure 9. Components of the fifth eigenvector of the normal matrix as a function of 
the components of the second eigenvector. 

characterized by a parameter p, which is the ratio of the number of added links 
to the number of all possible links, N(N—l)/2. Asset graphs constructed using 
the full correlation matrix C are illustrate in Fig. 10 for link occupation 
values of p = 0.01, p = 0.03, p = 0.05, andp = 0.07. An immediate observation 
is that some clusters stand out very cleanly and can already be identified by- 
visual inspection. These clusters correspond very well to business sectors and 
industries according to Forbes classification [45fl. However, we cannot expect 
to find all clusters this way and for large N this approach cannot be applied. It 
is clear that we need more sophisticated methods. One possibility, suggested 
by Onnela et al., is to define a cluster as an isolated component and study 
the evolution of these components as a function of p. Another possibility is 
to apply some known community detection method for binary graphs (see 



e.g. [3, |8|, |35|, |46|, l47fl) as a function of p. However, the problem with these 
approaches is that there is no global threshold value of p with which we would 
find (almost) all the information about the clusters. A more comprehensive 
picture about the cluster structure can be obtained by studying the evolution 
of each cluster separately as a function of p. Alternatively, we can also define 
our asset graphs in a different way. 



The largest components of the market eigenvector, mostly conglomerates and 
financial companies, have significant correlations with almost all the other 
companies. This leads to the phenomenon clearly seen in Fig. 10 that different 
clusters in asset graphs merge mostl y t hrough nodes corresponding to these 
companies (for further discussion see [39fl). Therefore, it is interesting to study 
asset graphs without the effect of the market eigenvector. This can be done 



For illustration, the node coordinates are generated with Pajek by plotting the 
MST. 
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Figure 10. (color online) The asset graph constructed using the full correlation matrix 
C for link occupation values 1) p = 0.01, 2) p = 0.03, 3) p = 0.05 and 4) p = 0.07. 
Forbes classification has been used and companies belonging to Energy sector 
are denoted by ▲, Electric Utilities industry by ♦, Healthcare sector by T, Basic 
Materials sector by ■ and Finacial as well as Conglomerates sector by . Other 
nodes are denoted by •. 



by expanding the correlation matrix as 



N 



C = 52\ i \e i )(e i \, (23) 



i=l 



where the eigenvalues are sorted according to decreasing rank and constructing 
the asset graphs using the matrix defined by 



iV 



C_ m = ^A. i |e,)(e J |. (24) 



i=2 



These are illustrated in Fig. 11 for link occupation values of p = 0.01, p = 0.03, 
p = 0.05, and p = 0.07. Here the most significant difference compared to 
Fig. 10 is that, since the market eigenvector is excluded, the degrees of the 
nodes with the highest betweeness centralities in the MST (see Fig. 1 in (2o| ) 
are much lower and some components remain isolated for larger values of p. 
However, from panel a of Fig. 14, where we illustrate as a function of p the 
number of isolated components of size larger than one, as well as from Fig. 11 
we notice that the problem still stands. There is no global threshold value of 
p that would reveal all the clusters. 



Kim et al. [39j| have approached the problem by defining, what they call the 
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Figure 11. (color online) The asset graph constructed using C_ m (i.e. correlation 
matrix from which the effect of the market eigenpair has been filtered out) for link 
occupation values 1) p = 0.01, 2) p = 0.03, 3) p = 0.05 and 4) p = 0.07. Nodes are 
denoted as in Fig. 10. 

group correlation matrix by 

C 9 = ^A i |e i )(e i |, (25) 

i=2 

where N g is used to exlude the effect of the random eigenpairs. From the 
previous section we know that by choosing N g < N we lose some information, 
but the idea here is to get rid of most of the noise without losing too much 
information. N g can be approximated by comparing the eigenvalues to the 
theoretical eigenvalue density for random correlation matrices and by studying 
the localization of the eigenvectors. In the following we have used N g = 10. 

In Figure 12 we show asset graphs constructed using C g for link occupation 
values of p = 0.01, p = 0.03, p = 0.05, and p = 0.07. We see that these graphs 
are very similar to those presented in Fig. 11, i.e., to the ones constructed by 
using C_ m . This is verified in Fig. 13, which shows the fraction of overlapping 
links, i.e. the percentage of common links, in the studied asset graphs. The 
shape of the curves turns out to be interesting. In all cases the overlap increases 
very rapidly until p ~ 0.025. After this, the overlap decreases indicating that 
the links become more random, but as the number of links grows larger and 
the fraction of "free" places decreases, the overlap starts to increase again. 
As a reference, one can use Erdos-Renyi ensemble G(m,N), which consists 
of graphs of N nodes and exactly m links, such that each possible graph 
appearing with equal probability. The overlap for G(m,N) is clearly p 2 . 
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Figure 12. (color online) The asset graph constructed using C g (i.e. correlation 
matrix from which the effects of the market and random eigenpairs has been filtered 
out) for link occupation values 1) p = 0.01, 2) p = 0.03, 3) p = 0.05 and 4) p = 0.07. 
Nodes are denoted as in Fig. 10. 
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Figure 13. The fraction of overlapping (i.e. common) links in asset graphs constructed 
from C g and C_ m (solid line), C and C_ m (dashed line), C and C g (dotted line) 
and in all three (dashdotted line). 

The overlap between asset graphs constructed from C g and C_ m is found 
to be around 93% for p « 0.025 and over 90% in the interval [0.022,0.037]. 
From panel a of Fig. 14, in which we show the number of isolated components 
as a function of p, one sees that this number is also at its highest in the 
interval, meaning that these are the most relevant values of p when studying 
the clustering. This means that we do not gain much by filtering out the 
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Figure 14. The number of isolated components m as a function of p (panel a) and 
q (panel b) in asset graphs constructed from C (solid line), C g (dashed line) and 
C- m (dotted line). For C g and C_ m there is a sudden increase at q c « 0.1 (panel 
b). This jump, however, is not seen in panel a. (Notice that the number of links 
increases as a function of p and decreases as a function of q.) 

random eigenpairs before constructing the asset graphs. At the same time we 
may lose some significant information about the small clusters stored in the 
lowest ranking eigenpairs, as discussed in the previous section. One should 
notice that cluster identification is more difficult when the full correlation 
matrix is used, but the difference is not very large. When this is combined 
with the fact that most information about the financial and conglomerate 
companies is lost when the market eigenvector is filtered out, it is evident 
that best results are obtained by using both C and C_ m . 

Kim et al. [39| have suggested that asset graphs constructed from C g have a 
well-defined critical threshold p c , where many isolated components merge into 
one giant component. They construct the asset graphs by including all the 
links that correspond to a correlation coefficient above a predetermined value 
q and plot the number of isolated components as a function of q. A similar 
plot for the present data set is shown in panel b of Fig. 14. At the first sight 
it seems that there exists a clear critical threshold q c ~ 0.1 (seen as a sudden 
jump in the number of components) for the asset graphs constructed from C g 
but no clear threshold for those constructed from the full correlation matrix. 
However, it is perhaps a little misleading to speak about a critical threshold, 
since the one "seen" in panel b of Fig. 14 is due to the fact that the elements of 
C g are not uniformly distributed. From panel a of Fig. 14 one sees that there 
is no clear threshold in none of these cases (as no sudden jumps are seen). 

To summarize, it seems that the noise present in the time series does not 
change the cluster structure of the asset graphs, which is not very surprising 
since only links corresponding to the highest correlation coefficients are in- 
cluded. It also seems that there is no critical threshold p c in any of the studied 
cases. Therefore, useful information may be lost, while no benefit is gained, if 
the random eigenpairs are filtered out before constructing the asset graphs. 



22 



6 Summary 



The aim of the present work was to investigate in complex systems the rela- 
tionship between the spectral properties of correlation based matrices and the 
cluster structure of the related networks. The network was constructed as a 
complete graph where the weights were identified with elements taken from 
the correlation matrix. We have chosen to study stock market data since large 
amount of information has already been accumulated about them and their 



spectral properties have also been studied in detail [2J,|25|]. Two data sets from 



the NYSE were analyzed, one with lesser stocks appropriate for visualization 
and a larger one with better statistical properties. 



We started our study by analyzing the eigenvector corresponding to the largest 
eigenvalue of the weight matrix and found to a very good first approximation 
that the eigenvector components correspond to the strengths of the nodes (i.e. 
companies). There is a systematic second order correction roughly propor- 
tional to the nodal strengths, which is probably due to the modular structure 
of the network. 



The identification of the clusters using the high ranking eigenvectors turned 
out to be a too formidable task. Therefore, we have chosen a different path: 
Using independent information, we have given interpretations to typical eigen- 
vectors. Our results show that there are eigenvectors which are well localized 
to a few industrial branches. Surprisingly, such eigenvectors are not have al- 
ways high ranking i.e. correspond to a large eigenvalue. On the other hand, 
some high rank eigenvectors represent so many branches that they are hardly 
distinguishable from the random case. Therefore, we think that the eigen- 
vectors are not appropriate in identifying the modules of such networks. By 
using the diffusion matrix we had to arrive to a similar conclusion, though 
it should be emphasized that there is a strong overlap between the highest 
ranking eigenvectors of the weight and diffusion matrices. 



Since direct network methods are known to be efficient in identifying the hier- 



archical structure of correlation based networks [13j, LL7|, LHa, |2l|] , we have stud- 
ied how the spectral methods can be combined with the asset graph method 
based on thresholding. We have compared the asset graphs as obtained from 
the noisy and denoised correlation matrices, where denoising was carried out 
by using spectral information [39|. It turned out that denoising has little effect 
on the clusters of asset graphs. This is because of the hierarchical structure of 
the clusters and due to the fact that thresholding picks the high correlations 
where the noise is expected to play a subordinate role. Surprisingly enough 
similar denoising methods seem to work efficiently when applied directly to 
portfolio optimization 
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We can conclude that the identification of clusters or communities is even more 
difficult in the case of highly connected weighted networks. Spectral methods 
may lead to an overall description of the properties of complex systems but 
they do not seem to be appropriate for the classification problem without 
additional information about the nodes of the related network. 
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